Lecture 8 addendum ================== This note discusses two very technical points related to lecture 8. The first is a correction to my definition of LL(1); the second is a brief explanation of the term "LL(1)". You are not responsible for the material in this addendum. Correction to definition of LL(1): ================================== You will not be responsible for this, but I just wanted to correct the record. There is a slight problem in the definition of LL(1) that I gave in the notes. Consider this grammar (this is, by the way, from the Wikipedia entry for top-down parsing): S -> A a b A -> a | epsilon According to our definition, this grammar is LL(1): The right-hand sides of the productions from A have FIRST(a) = { a } and FIRST(epsilon) = { bullet }, and these are non-overlapping. However, this grammar cannot be correctly parsed by recursive descent: Given that the first token in the input is a, parseA cannot reliably decide which production to use: If the input is aab, then it should use A->a, but if it is ab, then A->epsilon is the correct production. Thus, we said this grammar is LL(1), but the parser we constructed will not correctly parse it. What is the problem? When deciding what production to use in parseA, we failed to take into account that, when the next input token is a, both A->a and A->epsilon are valid productions. This is akin to the case where two right-hand sides have overlapping FIRST sets. There is a way to account for this, by considering so-called FOLLOW sets. When we see this epsilon production, we should ask this question: Is there any string of grammar symbols derivable from S in which A is immediately followed by a? If so, then when we see the a, we can't know whether to consume it (using the A -> a production) or leave it alone (use the A -> epsilon production). So in that case we should report that the grammar is not LL(1). The set of tokens t that can follow a non-terminal A in some string derived from the start symbol is called the follow set of A, or FOLLOW(A). The algorithm for computing follow sets can be found in any compiler textbook or online (but I don't want to take the time to cover it in class). The correct definition of LL(1) takes these into account in determining how to deal with epsilon productions. The idea is just to add FOLLOW(A) to FIRST(A) if A is nullable. But I didn't want to get into this technicality, and you will not be responsible for it. Why LL(1)? ========== A student asked what the significance of the name "LL(1)" is. I didn't want to take class time on this, because it's a bit complicated to answer. But I'll answer it here for those who are interested. First, the easy parts: The first "L" refers to the fact that parsing occurs during a left-to-right scan of the input; the "1" means that parsing decisions are based only on the next input token; if these decisions were based on the next *two* input tokens, and then we would call these grammars "LL(2)". Now the hard part: the second L. This stands for "leftmost derivation." Again, you are not responsible for this, but for those who are curious, here goes: We need to think about how parse trees are constructed during a top-down parse. Instead of doing it the way I did it in lecture 7, think of it this way: Suppose there is a global parse tree that we're constructing, and we keep adding to it whenever we make a parsing decision. That is, at any given time we are looking to expand (that is, add a right-hand side as children of) a non-terminal in the parse tree. We start with the parse tree consisting of just the start symbol S. As soon as we decide which production from S to use, we add its right-hand side as children of S. We then handle each of its symbols in turn, as described in class. Each time we make a decision about what production to use for a given non-terminal, we add it to the tree. For example, consider the grammar A -> id | ( B ) B -> A C C -> + A C | epsilon and input ( ( x ) ). We start with tree A and immediately realize that we need to use the second production for A, so we add (, B, and ) as children of A: A ( B ) We match the first (, then call parseB. parseB realizes - it has no choices - that it has to use production B -> A C, so it adds that to the tree: A ( B A C ) and then calls parseA. parseA again decides on the production A -> ( B ), and adds this to the tree: A ( B A ( B ) C ) It again matches the ( and calls parseB, which again expands the tree A ( B A ( B A C ) C ) and calls parseA. This time parseA matches the id and adds that: A ( B A ( B A id C ) C ) parseA returns to parseB, which calls C. C doesn't see a +, so it uses the epsilon production: A ( B A ( B A id C epsilon ) C ) We continue, eventually calling parseC again, which fills in the final part of the parse tree: A ( B A ( B A id C epsilon ) C epsilon ) We're done. Now let's write down the *frontiers* of these parse trees at each step, (if you draw the above as actual trees, it's easier to read off the frontiers): A => ( B ) => ( A C ) => ( ( B ) C ) => ( ( A C ) C ) => ( ( id C ) C ) => ( ( id ) C ) => ( ( id ) ). This is a derivation sequence, as defined in class today. What is notable about this derivation sequence is that, at each point, the *leftmost* non-terminal is the one that is replaced by a right-hand side of a production. It is characteristic of top-down parsing that the order in which it decides on what productions to use mimics a derivation with this property - a so-called leftmost derivation. The second "L" in "LL(1)" stands for "leftmost derivation."